Different soil conditions give rise to different bacteria. This review aims to analyze NEON data in order to establish a more detailed understanding of a specific site and specific bacteria phylum. We discuss the different environments and conditions found at our site and explore the taxonomic breakdown of the site. We also break down our phylum into lower taxonomic groups and analyze what conditions these data points are found in. We highlight interesting findings concerning the relationship between soil pH and nutrient content and genome size. Ultimately this review aims to visualize NEON data and highlight interesting findings within our visualizations.
The Onaqui field site (ONAQ), managed by the Bureau of Land Management and located in Tooele County, Utah, is a significant terrestrial research area within the National Ecological Observatory Network (NEON), roughly 80 km southwest of Salt Lake City. Encompassing 68 km² of the Great Basin’s characteristic sagebrush and juniper landscape, ONAQ operates within the arid, warm climate of the Great Basin Domain (D15), flanked by major American geographic features. The site’s geology, dominated by Paleozoic deposits, supports diverse soil types and a unique hydrology within the Rush-Tooele Valleys Watershed. Vegetation primarily consists of Big Sagebrush, with areas of Utah Juniper and Pinyon Pine, providing habitat for a variety of fauna, including pronghorn, coyotes, and various small mammals. Historical uses of the land reflect a rich tapestry of indigenous habitation, agricultural development, and conflict, leading to its current multi-use management focused on conservation, recreation, and grazing. (Holthuijzen and Veblen 2015) (Baxter 2018)
Methylomirabilota, also known as the NC10 Phylum, is a bacterial phylum. These microbes possess a unique metabolic capability to perform anaerobic methane oxidation coupled to nitrite reduction (ANME-2d), a process known as anaerobic methane oxidation (AOM). This metabolic trait is distinctive because it allows them to thrive in environments with low to no oxygen availability, such as certain freshwater and marine sediments, where methane and nitrite coexist. (He et al. 2016)
In this report, bacteria in the Methylomirabilota Phylum as well as organisms located at the Onaqui field site will be explored and patterns in both groups will be interpreted.
NEON used multiple data collecting methods to receive the samples for each site as well as phylum. For each site the data was collected to report the weather, climate, land cover and species within the ecosystem. The 3 methods that were used for data collection was Airborne Remote Sensing, Automated Instruments and observational sampling. The Airborne Remote sensing used spectrometers, digital cameras, lidar, GPS and Inertial Measurement unit in order to observe data. The automated instruments were used to collect soil, surface water and groundwater to examine patterns as well as the bacteria found in these locations. Finally, observations were split into aquatic observations and terrestrial observations in which species diversity and environmental or chemical properties could be examined. The data was presented on the NEON website in which we were able to retrieve the data for our corresponding site and phylum. This specific data was then translated into a csv file so that it could successfully be imported into Rmarkdown. This is when the data could then be configured to represent different graphs and tables to present the data. This was done by asking site or phylum specific questions with respect to the data. The questions asked during this process are mentioned below.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_edArchaea.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") ## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date (1): Date Added
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 6 pieces. Additional pieces discarded in 46 rows [3, 4, 24, 25, 26,
## 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 54, 232, 267, ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 446 rows [1, 2, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 46, 50, 53, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [4, 7, 8, 236,
## 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
## ...].
NEON_MAGs_bact_ind <- NEON_MAGs %>%
filter(Domain == "Bacteria") %>%
filter(`Assembly Type` == "Individual") NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>%
rename(`Genome Name` = `Genome Name / Sample Name`) %>%
filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>%
filter(str_detect(`Genome Name`, 'WREF plot', negate = T)) ## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") ## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>%
# remove -COMP from genomicsSampleID
mutate_at("genomicsSampleID", str_replace, "-COMP", "") ## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date (1): collectionDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kable(
NEON_chemistry_description <- read_tsv("data/NEON/neon_soilChem1_metadata_descriptions.tsv")
)## Rows: 23 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): fieldName, description, dataType, units
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| fieldName | description | dataType | units |
|---|---|---|---|
| siteID | NEON site code | string | NA |
| plotID | Plot identifier (NEON site code_XXX) | string | NA |
| sampleID | Identifier for sample | string | NA |
| horizon | Organic or mineral soil | string | NA |
| genomicsSampleID | Identifier for a genomics sample | string | NA |
| d15N | Measure of the ratio of 15N:14N in a sample, relative to atmospheric N2 | real | permill |
| organicd13C | Measure of the ratio of 13C:12C in soil organic carbon, relative to Vienna Pee Dee Belemnite | real | permill |
| nitrogenPercent | Percent nitrogen in a sample on a dry weight basis | real | percent |
| organicCPercent | Percent organic carbon in a sample on a dry weight basis | real | percent |
| CNratio | Ratio of carbon to nitrogen concentration in a sample on a dry weight basis | real | NA |
| nlcdClass | National Land Cover Database Vegetation Type Name | string | NA |
| subplotID | Identifier for the NEON subplot | string | NA |
| coreCoordinateX | x location of the soil core relative to the SW corner | real | meter |
| coreCoordinateY | y location of the soil core relative to the SW corner | real | meter |
| decimalLatitude | The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area | real | decimalDegree |
| decimalLongitude | The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area | real | decimalDegree |
| elevation | Elevation (in meters) above sea level | real | meter |
| sampleTiming | Timing of the sampling event with regard to the field season | string | NA |
| soilTemp | In-situ temperature of soil at approximately 10 cm depth | real | degree |
| sampleTopDepth | Depth below the soil surface of the top of a soil sample | real | centimeter |
| sampleBottomDepth | Depth below the soil surface of the bottom of a soil sample | real | centimeter |
| soilInWaterpH | pH value of soil measured in water solution | real | pH |
| soilInCaClpH | pH value of soil measured in calcium chloride solution | real | pH |
As mentioned before, Methylomirabilota, or NC10, is a bacterial phylum. Within the NEON database, there are 23 collections of bacteria that belong to the Methylomirabilota phylum. Out of the 23 collections, 16 collections were collected by the Individual Assembly while 7 collections were from the NEON combined assembly. This lab will mostly focus on the 16 collections from the Individual Assembly.
As it can be seen within the Individual Assembly collections below, all 16 out of the 16 collections belong to the Class Methylomirabilia. However, when going deeper down into the taxonomy, 15 out of the 16 collections belong to the Order Rokubacteriales while 1 collection belongs to the Order Methylomirabilales.
Looking at the breakdown of the families within the phylum, it has a similar pattern to the Order where 15 out of the 16 collections belong to the Family CSP1-6 while 1 collection belongs to the Order Methylmirabilales.
However, when looking at the Genus breakdown within the phylum, there are 5 genuses and 4 sets of novel bacterial genuses found with the Genus AR12 being the most present within the phylum.
Outside of the taxonomic breakdown, it can be seen that the Methylomirabilota Phylum was collected at 5 different NEON sites, with a majority of the collections at the National Grasslands LBJ and Konza Prairie Bio Station sites.
NEON_MAGs_methyl %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = `Order`)) +
geom_bar() +
coord_flip() +
labs(title = "Order at each Site", y = "Count", x = "Site")
#
NEON_MAGs_methyl %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = `Class`)) +
geom_bar() +
coord_flip() +
labs(title = "Order at each Site", y = "Count", x = "Site") NEON_MAGs_methyl %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = `Family`)) +
geom_bar() +
coord_flip() +
labs(title = "Family at each Site", y = "Count", x = "Site") NEON_MAGs_methyl %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = `Genus`)) +
geom_bar() +
coord_flip() +
labs(title = "Genus at each Site", y = "Count", x = "Site") table_9<-NEON_metagenomes %>%
full_join(NEON_MAGs, by = "Sample Name")%>%
full_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID"))%>%
filter(str_detect(`GTDB-Tk Taxonomy Lineage`,"Methylomirabilota"))table_10<-NEON_metagenomes %>%
full_join(NEON_MAGs, by = "Sample Name")%>%
full_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID"))table_10%>%
ggplot(aes(x=`Geographic Location`, fill=Phylum))+
geom_bar()+
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "Geographic Location for Methylomirabiota") Part of the NEON data is taken via soil samples. These soil samples are analyzed to determine which bacteria are present. We wanted to look at the taxonomic breakdown of bacteria at the site starting with Phylum and moving down to Family. When looking at the Phylums present, Actinobacteriota make up the majority of the bacteria present. In terms of classes present, the 4 biggest classes at the site are from this phylum. These 4 classes makeup the Actinobacteria phylum. When breaking down the taxonomy further, there are more orders within the phylum. At the family level a large portion of the Actinobacteria phylum lacks data regarding what family they belong to. When looking at specific subplots it can be seen that each subplot has a similar phylum and class makeup as the site data indicating that the taxonomic breakdown of the site is pretty uniform across the site.
NEON_utah %>%
ggplot(aes(x = fct_infreq(Phylum)))+
geom_bar(aes(color = Phylum), position = position_dodge2(width = 0.9, preserve = "single"))+
coord_flip()+
labs(title="Phylum count at Onaqui", x ="Phylum", y="Count")NEON_utah %>%
ggplot(aes(x = fct_infreq(Class)))+
geom_bar(aes(color = Phylum), position = position_dodge2(width = 0.9, preserve = "single"))+
coord_flip()+
labs(title="Class count at Onaqui", x ="Phylum", y="Count")NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = `Class`)) +
geom_bar() +
coord_flip() +
labs(title = "Class at each Phylum", y = "Count", x = "Phylum")NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = `Order`)) +
geom_bar() +
coord_flip() +
labs(title = "Order in Each Phylum", y = "Count", x = "Phylum")NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = `Family`)) +
geom_bar() +
coord_flip() +
labs(title = "Family in Each Phylum", y = "Count", x = "Phylum")NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Subplot)), fill = `Phylum`)) +
geom_bar() +
coord_flip() +
labs(title = "Subplot MAG Counts", y = "Count", x = "Subplot") NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Subplot)), fill = `Class`)) +
geom_bar() +
coord_flip() +
labs(title = "Class Makeup at Each Subplot", y = "Count", x = "Subplot")Most bacteria that are similar to each other are usually found in similar environments. It can be said the same for Methylomirobiota. As it can be seen, most of the bacteria were found in a Grasslands subtype at temperatures greater than 17 degrees Celsius. Additionally, the 1 collection in the Order Methylomirabilales was found in a wetlands subtype at a temperature of 15 degrees Celsius.
table_9%>%
ggplot(aes(x=fct_infreq(`Ecosystem Subtype`), y = soilTemp, color = Order))+
geom_point()+
coord_flip()+
labs(title = "Ecosystem Subtype vs. Soil Temperature", x = "Ecosytem Subtype", y = "Soil Temperature") ## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
When analyzing the Soil pH in which the bacteria were found, it was found that 1 collection in the Order Methylomirabilales was at a pH of 7, which means that the soil was not acidic or basic, but neutral. Moreover, 2 collections in the Order Rokubacteriales found at the temperate forest areas were at pHs below 5, which means that the soil were acidic. Finally, 11 collections in the Order Rokubacteriales found at the grassland areas were at pHs that ranged from 4.5 to 7, which means that the soils at those areas were slightly acidic to neutral.
table_9%>%
ggplot(aes(x=soilInCaClpH, y = `Ecosystem Subtype`, color = Family))+
geom_point()+
labs(title = "Soil pH", x = "CaCl Soil pH", y = "Ecosystem Subtype") ## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
In the Methylomirobiota phylum, collections were usually found at elevations under 1000 meters; however, one of the collections found in the temperate forest subtype was found at an elevation above 3000 meters. Additionally, the total number of bases as well as the total number of tRNA genes were not restricted to a small range. In fact, they were spread out and did not provide any unique properties in the phylum itself. Moreover, as seen below, soil temperature did not have any type of correlation with the total number of bases or the total number of tRNA genes. This meant that the environment did not influence the genetic sizes or gene amounts within the phylum itself.
table_9%>%
ggplot(aes(x=elevation, y=soilTemp, color=`Ecosystem Subtype`))+
geom_point() +
labs(title = "Elevation vs. Soil Temperature", x = "Elevation", y = "Soil Temperature")## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_9%>%
ggplot(aes(x=soilTemp, y=`Total Number of Bases`, color=Family))+
geom_point()+
labs(title = "Soil Temperature and Total Number of Bases", x = "Soil Temperature", y = "Total Number of Bases") ## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_9%>%
ggplot(aes(x=soilTemp, y=`tRNA Genes`, color=`Ecosystem Subtype`))+
geom_point()+
labs(title = "Soil Temperature and Total Number of tRNA Genes", x = "Soil Temperature", y = "tRNA Gene Count") ## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_9%>%
ggplot(aes(x=elevation, y=`Total Number of Bases`, color=`Ecosystem Subtype`))+
geom_point()+
labs(title = "Elevation and Total Number of Bases", x = "Elevation", y = "Total Number of Bases") ## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_9%>%
ggplot(aes(x=elevation, y=`tRNA Genes`, color=Genus))+
geom_point()+
labs(title = "Elevation and tRNA Gene Count", x = "Elevation", y = "tRNA Gene Count")## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
In looking at our site specific data we wanted to make comparisons across multiple sites. We looked primarily at two factors: soil temperature and soil pH. When compared to other sites, our site fell within the middle range of soil temperatures, ranging from 10 to 20 degrees celsius. The only sites that fell out of this middle range were the sites in Alaska with colder temperatures and the sites in Arizona and Puerto Rico with hotter temperatures. When comparing soil pH at sites, there was less data, but some interesting findings. Onaqui had higher pH than other sites, ranging from 8-9. The only site that had data points within this range was in Tuscon, Arizona, and even then there were data points with lower pH.
table_10<-NEON_metagenomes %>%
full_join(NEON_MAGs, by = "Sample Name")%>%
full_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID"))table_10%>%
ggplot(aes(x=fct_infreq(`Geographic Location`), y=soilTemp))+
geom_boxplot()+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Site vs. Soil Temperature", x = "Geographic Location", y = "Soil Temperature") ## Warning: Removed 625 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
table_10%>%
ggplot(aes(x=Site.x, y=soilInWaterpH, color=Phylum))+
geom_point()+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Soil in Water pH at Sites", x = "Site", y = "Soil pH") ## Warning: Removed 624 rows containing missing values or values outside the scale range
## (`geom_point()`).
Research was conducted to determine how different the sites are in relation to their soil temperature and if Onaqui’s temperature was on an extreme. As seen below, the soil temperature found at Onaqui ranged at 11 to 16 degrees Celsius with its median temperature being 14.5 degrees Celsius. Compared to the other sites, Onaqui’s soil temperature was not found to be at an extreme, but a moderate temperature. This allows us to see that temperature was not a factor in bacterial collections.
table_10%>%
ggplot(aes(x=Site.x, y=soilTemp, color=Phylum))+
geom_point()+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Soil Temperature at Sites", x = "Site", y = "Soil Temp") ## Warning: Removed 625 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x=fct_infreq(`Geographic Location`), y=soilTemp))+
geom_boxplot()+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Site vs. Soil Temperature", x = "Geographic Location", y = "Soil Temperature") ## Warning: Removed 625 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Next we wanted to look at the role soil environment plays in overall taxonomy. When looking at soil temperature, most Phylums fall within the 10-20 degree celsius range. Actinobacteriota was mostly found in 15-20 degree celsius range. Methylomirabilota was found at similar temperatures, but had much less data points outside this range. In terms of pH, most phylums fell within the 5-7 range. Actinobacteria was found mainly at pHs of 5.5-7.5. Methylomirabilota ranged mainly from 6-7.
table_10%>%
ggplot(aes(x=soilTemp, y=soilInWaterpH, color=Phylum))+
geom_point()+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Soil Temperature and pH at Sites", x = "Soil Temperature", y = "Soil pH") ## Warning: Removed 625 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x=Phylum, y=soilTemp)) +
geom_jitter(alpha = 0.3, color = "tomato")+
geom_violin() +
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "Soil Temperature for Phylums", x = "Phylums", y = "Soil Temp")## Warning: Removed 625 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Warning: Removed 625 rows containing missing values or values outside the scale range
## (`geom_point()`).
There are multiple factors that affect soil. Some explored in the graphs below include soil temperature, soil pH, elevation, nitrogen percent, and organic carbon percentage. Given these factors, one can reasonably assume that one of these might affect tRNA counts in the organisms found in the collections. However, the graphs below show that there is little to no correlation among these factors and its effect on the tRNA count of the organisms. This means that the soil in which the organisms survive is not influential to its tRNA amount.
table_10%>%
ggplot(aes(x= `tRNA Genes`, y = soilTemp))+
geom_point()+
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "tRNA gene count vs. soil temp")## Warning: Removed 627 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x= `tRNA Genes`, y = soilInCaClpH))+
geom_point()+
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "tRNA gene count vs. pH") ## Warning: Removed 626 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x= `tRNA Genes`, y = elevation, color=`Ecosystem Subtype`))+
geom_point()+
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "tRNA gene count vs. elevation") ## Warning: Removed 626 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x= `tRNA Genes`, y = nitrogenPercent, color=`Ecosystem Subtype`))+
geom_point()+
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "tRNA gene count vs. nitrogen") ## Warning: Removed 1402 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x= `tRNA Genes`, y = organicCPercent, color = `Ecosystem Subtype`))+
geom_point(alpha=0.6)+
theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1))+
labs(title = "tRNA gene count vs. organic C") ## Warning: Removed 1402 rows containing missing values or values outside the scale range
## (`geom_point()`).
table_10%>%
ggplot(aes(x=fct_infreq(`Phylum`), y=`tRNA Genes`))+
geom_boxplot()+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Site vs. Soil Temperature", x = "Geographic Location", y = "Soil Temperature") ## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
We wanted to visualize both the tree visualization of our phylum and at our site. At our site a few interesting things were found. For the most part genome size is relatively similar between different genus within the same family. One data point in particular stood out. The sample taken from the wetland shows that the species isn’t closely related to any of the other species within the phylum. This is interesting as it is the only species found in the wetland ecosystem amongst NEON samples.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("treeio")## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.19 (BiocManager 1.30.23), R 4.4.0 (2024-04-24)
## Installing package(s) 'BiocVersion', 'treeio'
## Installation paths not writeable, unable to update packages
## path: /opt/R/4.4.0/lib/R/library
## packages:
## survival
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.19 (BiocManager 1.30.23), R 4.4.0 (2024-04-24)
## Installing package(s) 'ggtreeExtra'
## also installing the dependencies 'patchwork', 'aplot', 'ggtree'
## Installation paths not writeable, unable to update packages
## path: /opt/R/4.4.0/lib/R/library
## packages:
## survival
## ggtree v3.12.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
##
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:tidyr':
##
## expand
library(TDbook) #A Companion Package for the Book "Data Integration, Manipulation and Visualization of Phylogenetic Trees" by Guangchuang Yu (2022, ISBN:9781032233574).
library(ggimage)
library(rphylopic)## You are using rphylopic v.1.4.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
##
## Attaching package: 'rphylopic'
## The following object is masked from 'package:ggimage':
##
## geom_phylopic
## treeio v1.28.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
## Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
## object for visualization of a phylogenetic tree and annotation data.
## iMeta 2022, 1(4):e56. doi:10.1002/imt2.56
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
## Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
## object for visualization of a phylogenetic tree and annotation data.
## iMeta 2022, 1(4):e56. doi:10.1002/imt2.56
##
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
##
## Attaching package: 'tidytree'
## The following object is masked from 'package:treeio':
##
## getNodeNum
## The following object is masked from 'package:stats':
##
## filter
##
## Attaching package: 'ape'
## The following objects are masked from 'package:tidytree':
##
## drop.tip, keep.tip
## The following object is masked from 'package:treeio':
##
## drop.tip
## The following object is masked from 'package:ggtree':
##
## rotate
## The following object is masked from 'package:dplyr':
##
## where
##
## Attaching package: 'TreeTools'
## The following object is masked from 'package:tidytree':
##
## MRCA
## The following object is masked from 'package:treeio':
##
## MRCA
## The following object is masked from 'package:ggtree':
##
## MRCA
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
##
## Attaching package: 'phytools'
## The following object is masked from 'package:TreeTools':
##
## as.multiPhylo
## The following object is masked from 'package:treeio':
##
## read.newick
## ggtreeExtra v1.14.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>%
left_join(NEON_metagenomes, by = "Sample Name") %>%
left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>%
rename("label" = "Bin ID")ggtree(tree_Methyl, layout="circular") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_point2(mapping=aes(color=`Ecosystem Subtype`, size=`Total Number of Bases`))## Warning: Removed 22 rows containing missing values or values outside the scale range
## (`geom_point_g_gtree()`).
Below, the phylogeny of the organisms found at Onaqui are shown. As it seen, the phylogeny is normal as organisms found in the same phylum are close to each other in the phylogeny. However, organisms that are too similar to each other in the phylogeny are found at different ecosystem subtypes
# For unknown reasons the following does not like blank spaces in the names
NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>%
rename("AssemblyType" = "Assembly Type") %>%
rename("BinCompleteness" = "Bin Completeness") %>%
rename("BinContamination" = "Bin Contamination") %>%
rename("TotalNumberofBases" = "Total Number of Bases") %>%
rename("EcosystemSubtype" = "Ecosystem Subtype")
ggtree(tree_Methyl) %<+%
NEON_MAGs_metagenomes_chemistry +
geom_tippoint(aes(colour=`Ecosystem Subtype`)) +
# For unknown reasons the following does not like blank spaces in the names
geom_facet(panel = "Bin Completeness", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_point,
mapping=aes(x = BinCompleteness)) +
geom_facet(panel = "Bin Contamination", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_col,
aes(x = BinContamination), orientation = 'y', width = .6) +
theme_tree2(legend.position=c(.1, .7))NEON_MAGs_metagenomes_chemistry_ONAQ <- NEON_MAGs_metagenomes_chemistry %>%
filter(`Site ID.x` == "ONAQ")NEON_MAGs_metagenomes_chemistry_ONAQ <- NEON_MAGs_metagenomes_chemistry %>%
filter(`Site ID.x` == "ONAQ") %>%
filter(Domain.x == "Bacteria")tree_bac_ONAQ_MAGs <-drop.tip(tree_bac,tree_bac$tip.label[-match(ONAQ_MAGs_label, tree_bac$tip.label)])ggtree(tree_bac_ONAQ_MAGs, layout="circular") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_point(mapping=aes(color=Phylum)) # Select the GTDB Taxonomic lineage and separate into taxonomic levels
sankey_data <- NEON_MAGs_co %>%
select(`GTDB-Tk Taxonomy Lineage`) %>%
# NAs are likely Archaea
replace_na(list(`GTDB-Tk Taxonomy Lineage` = 'Archaea')) %>%
# Pavian format requires p__ etc
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), "; ") ## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 615 rows [2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ...].
# Fill in the NAs with the taxonomic the higher taxonomic level to the left
sankey_data[] <- t(apply(sankey_data, 1, zoo::na.locf))
# Put the data into a format that can be read by the Sankey App
sankey_data <- sankey_data %>%
unite(col = "classification", c(Domain, Phylum, Class, Order, Family, Genus, Species), sep='; ') %>%
mutate_at("classification", str_replace, "Archaea", "d__Archaea") %>%
mutate_at("classification", str_replace, "Bacteria", "d__Bacteria") %>%
mutate_at("classification", str_replace, "; ", "|p__") %>%
mutate_at("classification", str_replace, "; ", "|c__") %>%
mutate_at("classification", str_replace, "; ", "|o__") %>%
mutate_at("classification", str_replace, "; ", "|f__") %>%
mutate_at("classification", str_replace, "; ", "|g__") %>%
mutate_at("classification", str_replace, "; ", "|s__")
# Create format for Pavian with counts for each taxonomic level
sankey_data_s <- sankey_data
sankey_data_g <- sankey_data
sankey_data_f <- sankey_data
sankey_data_o <- sankey_data
sankey_data_c <- sankey_data
sankey_data_p <- sankey_data
sankey_data_d <- sankey_data
sankey_data_g$classification <- sub("\\|s__.*", "", sankey_data_g$classification)
sankey_data_f$classification <- sub("\\|g__.*", "", sankey_data_f$classification)
sankey_data_o$classification <- sub("\\|f__.*", "", sankey_data_o$classification)
sankey_data_c$classification <- sub("\\|o__.*", "", sankey_data_c$classification)
sankey_data_p$classification <- sub("\\|c__.*", "", sankey_data_p$classification)
sankey_data_d$classification <- sub("\\|p__.*", "", sankey_data_d$classification)
sankey_data_allTaxa <- bind_rows(sankey_data_s, sankey_data_g, sankey_data_f, sankey_data_o, sankey_data_c, sankey_data_p, sankey_data_d) %>%
mutate(classification = as.factor(classification)) %>%
count(classification) %>%
# rename for Pavian format
rename(`#SampleID` = `classification`) %>%
rename(`Metaphlan2_Analysis` = `n`)
# Write file to input to Pavian Sankey
write_tsv(sankey_data_allTaxa, "NEON_MAG_co_pavian.txt")NEON_MAGs_ind_Methylo <- NEON_MAGs %>%
filter(Domain == "Bacteria") %>%
filter(`Assembly Type` == "Individual") %>%
filter(Phylum == "Methylomirabilota")# Select the GTDB Taxonomic lineage and separate into taxonomic levels
sankey_data_exer1 <- NEON_MAGs_ind_Methylo %>%
select(`GTDB-Tk Taxonomy Lineage`) %>%
# NAs are likely Archaea
replace_na(list(`GTDB-Tk Taxonomy Lineage` = 'Archaea')) %>%
# Pavian format requires p__ etc
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), "; ") ## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 16 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].
# Fill in the NAs with the taxonomic the higher taxonomic level to the left
sankey_data_exer1[] <- t(apply(sankey_data_exer1, 1, zoo::na.locf))
# Put the data into a format that can be read by the Sankey App
sankey_data_exer1 <- sankey_data_exer1 %>%
unite(col = "classification", c(Domain, Phylum, Class, Order, Family, Genus, Species), sep='; ') %>%
mutate_at("classification", str_replace, "Archaea", "d__Archaea") %>%
mutate_at("classification", str_replace, "Bacteria", "d__Bacteria") %>%
mutate_at("classification", str_replace, "; ", "|p__") %>%
mutate_at("classification", str_replace, "; ", "|c__") %>%
mutate_at("classification", str_replace, "; ", "|o__") %>%
mutate_at("classification", str_replace, "; ", "|f__") %>%
mutate_at("classification", str_replace, "; ", "|g__") %>%
mutate_at("classification", str_replace, "; ", "|s__")
# Create format for Pavian with counts for each taxonomic level
sankey_data_exer1_s <- sankey_data_exer1
sankey_data_exer1_g <- sankey_data_exer1
sankey_data_exer1_f <- sankey_data_exer1
sankey_data_exer1_o <- sankey_data_exer1
sankey_data_exer1_c <- sankey_data_exer1
sankey_data_exer1_p <- sankey_data_exer1
sankey_data_exer1_d <- sankey_data_exer1
sankey_data_exer1_g$classification <- sub("\\|s__.*", "", sankey_data_exer1_g$classification)
sankey_data_exer1_f$classification <- sub("\\|g__.*", "", sankey_data_exer1_f$classification)
sankey_data_exer1_o$classification <- sub("\\|f__.*", "", sankey_data_exer1_o$classification)
sankey_data_exer1_c$classification <- sub("\\|o__.*", "", sankey_data_exer1_c$classification)
sankey_data_exer1_p$classification <- sub("\\|c__.*", "", sankey_data_exer1_p$classification)
sankey_data_exer1_d$classification <- sub("\\|p__.*", "", sankey_data_exer1_d$classification)
sankey_data_exer1_allTaxa <- bind_rows(sankey_data_exer1_s, sankey_data_exer1_g, sankey_data_exer1_f, sankey_data_exer1_o, sankey_data_exer1_c, sankey_data_exer1_p, sankey_data_exer1_d) %>%
mutate(classification = as.factor(classification)) %>%
count(classification) %>%
# rename for Pavian format
rename(`#SampleID` = `classification`) %>%
rename(`Metaphlan2_Analysis` = `n`)
# Write file to input to Pavian Sankey
write_tsv(sankey_data_exer1_allTaxa, "NEON_MAG_exer1_pavian.txt")Here, we have the Sankey plot for the individual assemblies for Methylomirabilota at all sites.
NEON_MAGs_com_Methylo <- NEON_MAGs %>%
filter(Domain == "Bacteria") %>%
filter(`Assembly Type` == "Combined") %>%
filter(Phylum == "Methylomirabilota")# Select the GTDB Taxonomic lineage and separate into taxonomic levels
sankey_data_exer2 <- NEON_MAGs_com_Methylo %>%
select(`GTDB-Tk Taxonomy Lineage`) %>%
# NAs are likely Archaea
replace_na(list(`GTDB-Tk Taxonomy Lineage` = 'Archaea')) %>%
# Pavian format requires p__ etc
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), "; ") ## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 7 rows [1, 2, 3, 4, 5, 6,
## 7].
# Fill in the NAs with the taxonomic the higher taxonomic level to the left
sankey_data_exer2[] <- t(apply(sankey_data_exer2, 1, zoo::na.locf))
# Put the data into a format that can be read by the Sankey App
sankey_data_exer2 <- sankey_data_exer2 %>%
unite(col = "classification", c(Domain, Phylum, Class, Order, Family, Genus, Species), sep='; ') %>%
mutate_at("classification", str_replace, "Archaea", "d__Archaea") %>%
mutate_at("classification", str_replace, "Bacteria", "d__Bacteria") %>%
mutate_at("classification", str_replace, "; ", "|p__") %>%
mutate_at("classification", str_replace, "; ", "|c__") %>%
mutate_at("classification", str_replace, "; ", "|o__") %>%
mutate_at("classification", str_replace, "; ", "|f__") %>%
mutate_at("classification", str_replace, "; ", "|g__") %>%
mutate_at("classification", str_replace, "; ", "|s__")
# Create format for Pavian with counts for each taxonomic level
sankey_data_exer2_s <- sankey_data_exer2
sankey_data_exer2_g <- sankey_data_exer2
sankey_data_exer2_f <- sankey_data_exer2
sankey_data_exer2_o <- sankey_data_exer2
sankey_data_exer2_c <- sankey_data_exer2
sankey_data_exer2_p <- sankey_data_exer2
sankey_data_exer2_d <- sankey_data_exer2
sankey_data_exer2_g$classification <- sub("\\|s__.*", "", sankey_data_exer2_g$classification)
sankey_data_exer2_f$classification <- sub("\\|g__.*", "", sankey_data_exer2_f$classification)
sankey_data_exer2_o$classification <- sub("\\|f__.*", "", sankey_data_exer2_o$classification)
sankey_data_exer2_c$classification <- sub("\\|o__.*", "", sankey_data_exer2_c$classification)
sankey_data_exer2_p$classification <- sub("\\|c__.*", "", sankey_data_exer2_p$classification)
sankey_data_exer2_d$classification <- sub("\\|p__.*", "", sankey_data_exer2_d$classification)
sankey_data_exer2_allTaxa <- bind_rows(sankey_data_exer2_s, sankey_data_exer2_g, sankey_data_exer2_f, sankey_data_exer2_o, sankey_data_exer2_c, sankey_data_exer2_p, sankey_data_exer2_d) %>%
mutate(classification = as.factor(classification)) %>%
count(classification) %>%
# rename for Pavian format
rename(`#SampleID` = `classification`) %>%
rename(`Metaphlan2_Analysis` = `n`)
# Write file to input to Pavian Sankey
write_tsv(sankey_data_exer2_allTaxa, "NEON_MAG_exer2_pavian.txt")Here, we have the Sankey plot for the combined assemblies in the Methylomirabilota phylum.
# Select the GTDB Taxonomic lineage and separate into taxonomic levels
sankey_data_exer3 <- NEON_MAGs_ONAQ %>%
select(`GTDB-Tk Taxonomy Lineage`) %>%
# NAs are likely Archaea
replace_na(list(`GTDB-Tk Taxonomy Lineage` = 'Archaea')) %>%
# Pavian format requires p__ etc
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), "; ") ## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 57 rows [1, 5, 6, 7, 8,
## 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, ...].
# Fill in the NAs with the taxonomic the higher taxonomic level to the left
sankey_data_exer3[] <- t(apply(sankey_data_exer3, 1, zoo::na.locf))
# Put the data into a format that can be read by the Sankey App
sankey_data_exer3 <- sankey_data_exer3 %>%
unite(col = "classification", c(Domain, Phylum, Class, Order, Family, Genus, Species), sep='; ') %>%
mutate_at("classification", str_replace, "Archaea", "d__Archaea") %>%
mutate_at("classification", str_replace, "Bacteria", "d__Bacteria") %>%
mutate_at("classification", str_replace, "; ", "|p__") %>%
mutate_at("classification", str_replace, "; ", "|c__") %>%
mutate_at("classification", str_replace, "; ", "|o__") %>%
mutate_at("classification", str_replace, "; ", "|f__") %>%
mutate_at("classification", str_replace, "; ", "|g__") %>%
mutate_at("classification", str_replace, "; ", "|s__")
# Create format for Pavian with counts for each taxonomic level
sankey_data_exer3_s <- sankey_data_exer3
sankey_data_exer3_g <- sankey_data_exer3
sankey_data_exer3_f <- sankey_data_exer3
sankey_data_exer3_o <- sankey_data_exer3
sankey_data_exer3_c <- sankey_data_exer3
sankey_data_exer3_p <- sankey_data_exer3
sankey_data_exer3_d <- sankey_data_exer3
sankey_data_exer3_g$classification <- sub("\\|s__.*", "", sankey_data_exer3_g$classification)
sankey_data_exer3_f$classification <- sub("\\|g__.*", "", sankey_data_exer3_f$classification)
sankey_data_exer3_o$classification <- sub("\\|f__.*", "", sankey_data_exer3_o$classification)
sankey_data_exer3_c$classification <- sub("\\|o__.*", "", sankey_data_exer3_c$classification)
sankey_data_exer3_p$classification <- sub("\\|c__.*", "", sankey_data_exer3_p$classification)
sankey_data_exer3_d$classification <- sub("\\|p__.*", "", sankey_data_exer3_d$classification)
sankey_data_exer3_allTaxa <- bind_rows(sankey_data_exer3_s, sankey_data_exer3_g, sankey_data_exer3_f, sankey_data_exer3_o, sankey_data_exer3_c, sankey_data_exer3_p, sankey_data_exer3_d) %>%
mutate(classification = as.factor(classification)) %>%
count(classification) %>%
# rename for Pavian format
rename(`#SampleID` = `classification`) %>%
rename(`Metaphlan2_Analysis` = `n`)
# Write file to input to Pavian Sankey
write_tsv(sankey_data_exer3_allTaxa, "NEON_MAG_exer3_pavian.txt")Here, we have the Sankey plot for the Onaqui Site.
Our site, Onaqui basin, Utah showed to have environmental properties that put it in the middle range of soil temperature. When it came to pH, Onaqui was more alkaline than the other sites. This aligns with shrubland conditions which generally have higher pH and low nutrient content. In terms of taxonomic breakdown at the site, a vast majority of bacteria fell into the actinobacteria phylum. Considering this was the most common phylum across the phylums in the NEON data, this data makes sense.
Our Phylum, Methylomirbilota, is less common than other phylums. The phylum stuck to a small range of temperatures and pH. Taxonomically, there were few classes and orders. From the data it would appear to be a small phylum. Amongst specific data points, one data point stood out as interesting. Typically the phylum would appear in grasslands, shrubland or temperate forest, one data point was found in wetlands. One interesting finding was that the most of the collections that were done on organisms in the phylum were found at elevations below 1000 meters while the only one was found at an elevation near 3000 meters, which was the wetlands collection. This data point was genetically distant from the other data points, probably resulting in its ability to survive in an environment this phylum wouldn’t be expected to be found at.